Comparative and phylogenetic analysis of the complete chloroplast genome sequences of Allium mongolicum

Allium mongolicum Regel is a wild and sandy vegetable with unique flavours. In this study, a complete chloroplast (cp) genome of A. mongolicum was obtained (Genbank accession number: OM630416), and contained 153,609 base pairs with the GC ratio as 36.8%. 130 genes were annotated including 84 protein-coding genes, 38 tRNA, and 8 rRNA genes. The large single-copy (LSC) region was 82,644 bp, and a small single-copy (SSC) region was 18,049 bp, which were separated by two inverted repeats (IRs, including IRa and IRb) of 26,458 bp. Comparative genome analyses of 55 Allium species suggested that genomic structure of genus Allium was conserved, and LSC and SSC regions were outstanding with high variability. Among them, more divergent loci were in the SSC region covering ycf1-rrn4.5 and ndhF-ccsA. Phylogenetic analysis on cp genomes of 55 Allium determined that all members were clustered into 13 clades, and A. mongolicum had close relationship with A. senescens. Corresponding analyses of four protein-coding genes (ycf1, ndhF, rpl32, and ccsA) in aforementioned divergent loci confirmed that ycf1 was finally chosen as the candidate gene for species identification and evolutionary classification of genus Allium. These data provide valuable genetic resources for future research on Allium.

www.nature.com/scientificreports/ Allium species. For example, sixteen cp genomes from Cepa were assessed the genes arrangement and variation, and a phylogeny of Cepa was constructed to explain the domestication of the common onion 19 . Xie et al. performed analyses of six cp genome structure, and uncovered the selective pressure in Daghestanica 20 . In addition, Xie et al. collected 39 Allium cp genomes and revealed that the divergence time of three traditional evolutionary lineages was presumably in the early Eocene to the middle Miocene 21 . However, excellent genes selected for reflecting classification and evolution of species in the genus Allium are scarce. Hence, the phylogenetic relationships from the point of total Allium species need to be further studied.
Here, high-throughput sequencing and bioinformatics technology were used to assemble a cp genome sequence of A. mongolicum (Genbank accession number: OM630416). 54 cp genomes from genus Allium were compared to explore the structure and evolution of the A. mongolicum chloroplast genome. Subsequently, phylogenomic analyses were performed based on these 55 chloroplast genomes to obtain a gene of choice for identifying and classifying of members from genus Allium. Our study will provide references for studying the genetic diversity and phylogenetic relationship in genus Allium.

Results
Basic characteristics of the Allium mongolicum chloroplast genome. The extracted DNA was assessed, and the initial raw data with 8.98 Gbp was used to subsequent analysis. The finished cp genome was submitted to GenBank with the accession number OM630416. The length of the cp genome of the A. mongolicum was 153,609 bp, composed of a large single-copy (LSC) region of 82,644 bp and a small single-copy (SSC) region of 18,049 bp separated by two inverted repeats (IRs, including IRa and IRb) of 26,458 bp (Fig. 1). The overall GC (guanine and cytosine) content of the whole chloroplast genome was 36.8%. A total of 130 genes were annotated, including 84 protein-coding genes, 38 tRNA, and eight rRNA genes (Table 1). There are 21 chloroplast genes harbored introns, among which 19 genes contained single introns, and two genes (ycf3, clpP) contained two introns.  In order to elucidate characteristics of sequence divergence among the 55 Allium cp genomes, nine representative species were selected to estimate Allium genome comparability using mVISTA program. The genome of A. altaicum was taken as the reference to conduct this program. Results revealed that cp genomes were relatively conserved in all Allium genomes (Fig. 2). No notably differences in gene order were detected when comparing the A. mongolicum cp genome to those of related Allium species. Generally, the highly divergent regions among representative Allium cp genomes mainly occurred in the LSC and SSC regions, and the coding regions exhibited less variations than the non-coding regions. In terms of species, high similarity and low divergence were detected in the cp genomes among A. altaicum, A.cepa and A. mongolicum, and the same things were tested among the group covering A. caeruleum and A. ampeloprasum, as well as group including A. trifurcatum, A. fetisowii, A. nerinifolium and A. nanodes.
To further exploit available polymorphic genes for identifying novel species, we further calculate the nucleotide diversity (Pi) values of overall sequences within the 600 bp window (Fig. 3). We found that Pi values ranged from 0 to 0.06769, and relatively high Pi values were determined in the SSC region, followed closely by those in the LSC region. The average Pi values of SSC and LSC regions were 0.0456 and 0.0219, respectively, but that of IR regions was 0.0052 (Table S1). These differences showed that two IR regions were more conserved than LSC and SSC regions. Here, most of cp genomes variations of Allium species existed in the SSC region, and two hypervariable regions (Pi > 0.06) were highlighted including variable loci ycf1-rrn4.5 (0.06052-0.06769) and ndhF-ccsA (0.06195-0.06516). The divergent regions called ycf1-rrn4.5 contained ycf1, tRNA-Asn, tRNA-Arg, rrn5 and rrn4.5, and ndhF-ccsA loci comprised ndhF, rpl32, tRNA-Leu and ccsA. Among these genes, four coding genes (ycf1, ndhF, rpl32, and ccsA) were understanding in terms of general conservation. Thus, these polymorphic regions might be novel candidate fragments for population genetic studies of Allium species.
Codon usage. Codon usage is the correlation to mRNA and protein, which is an essential characteristic for gene expression in plants genomes 12 . We estimated in detail codon usage frequency associated with all proteincoding sequences in A. mongolicum. Results showed that leucine (L), arginine (R) and serine (S) were the highest frequent codons, and methionine (M) and tryptophan (W) were least frequent (Fig. 4). Next, 51 grouped Allium species cp genomes were used to calculated in detail codon usage frequency, and results showed that these protein-coding genes were encoded by 19,832 (A. przewalskianum) to 26,802 (A. fistulosum) codons (Table S2). Like to other monocots plants, UGA, UAG, and UAA were known as the termination codons. For these Allium species, we found that the UUA encoded leucine (Leu) had the highest relative synonymous codon usage (RSCU) value in A. przewalskianum at approximately 2.23, and the in A. przewalskianum CUG encoded leucine had the lowest RSCU value at approximately 0.29. Same to A. mongolicum, methionine and tryptophan are used the least, but only leucine is used the most. In addition, 30 codons of Allium species were > 1, and 32 codons were < 1. Table 1. Genes contained in Allium mongolicum chloroplast genome. a Intron-containing genes. b Genes located in the IR regions.

Category
Groups of genes Name of genes www.nature.com/scientificreports/ Similar to other monocots chloroplast genomes, the nucleotide frequency of G or C from Allium species was lower than those of A or T at the whole codon, as well as the third codon position. The preference was considered to be a comprehensive result of gene expression, natural selection, and speciation mechanisms in species 19 .

Contraction and expansion of IR regions.
It is well known that the cp genomes are usually well conserved with typical quadripartite structure and specific order. Through diversities of cp genomes mainly depend on highly divergent and lowly variable LSC and SSC regions, the size variations of cp genomes are strongly linked to the contraction and expansion of two IR regions, which can reflect species evolution. Therefore, IR boundaries of 9 representative members were detected to explain the differences of Allium cp genome size, including the boundaries of LSC/IRb regions (JLB), boundaries of SSC/IRb regions (JSB), boundaries of SSC/

Phylogenetic analysis. To know the evolutionary location of A. mongolicum and genetic clusters of genus
Allium, phylogenetic tree of 55 Allium members was constructed. Phylogenetic analysis of the cp genome suggested that all members were clustered into 13 clades, and three outgroup ones derived from one clade (Fig. 6). Among them, A. mongolicum found in this study had close relationship with A. senescens in the genus Allium.
To further obtain a gene of choice on assessing phylogeny of Allium species based on aforementioned Pi values, we constructed the phylogenetic trees according to the most variable four protein-coding genes (ycf1, ndhF, rpl32, and ccsA). Results showed that these phylogenetic trees were less similar with various major clades, and the cladogram for ycf1 was accordant to those of the whole-length cp genomes (Fig. 7).

Discussion
Allium mongolicum is distributed in desert areas with strong drought-resistance, and is an infrequent Allium species of Amaryllidaceae family. Unlike other Allium plants, A. mongolicum mainly located in the barren land with territorial restrictions, and related researches were lacking. As an essential member of genus Allium, it plays a critical role in explaining the evolutionary relationship of genus Allium. In this study, we sequenced a cp genome of A. mongolicum (153,609 bp), and exhibited the typical quadripartite structure including 130 unigenes. These genes were composed of 84 coding genes, 38 tRNAs and 8 rRNAs. Compared to cp genomes from 55 genus Allium, all genomes had a high conservation in the genome structure, length and organization. It is obvious that the LSC and SSC regions were highly divergent regions than two IR regions, and analysis of nucleotide diversity also confirmed this view. We further found that the more variable loci were in the SSC region www.nature.com/scientificreports/ comprising ycf1-rrn4.5 and ndhF-ccsA. The traditional taxonomy of Allium is mostly based on plant morphology such as tillering characteristics and pseudo-stem morphology. For a long time, the taxonomic status of Allium genus has changed frequently. In this study, phylogenetic analysis of full-length cp genome revealed that 55 Allium members were separated into 13 clades. In addition, members from Group VI including A. mongolicum shared closer relationship and more similar gene contents with members from Group VII. The 13 clades obtaining from the phylogenetic analyses of cp genomes was in accord with the modern taxonomic classification. In addition, the detailed evolutionary analyses of four protein-coding genes from two above-mentioned key loci were performed, and ycf1 was finally chosen as the candidate gene for species identification and evolutionary classification of genus Allium. Contrastive analysis of the Allium chloroplast genomes showed that the size was wide range from 145 to 154 kb, and the difference was mainly due to JSB and JSA regions. Two ycf1 genes on the boundaries of IR regions were crucial, which concatenated the single copy region and the reverse repeat region. Specifically speaking, the length of A. monanthum was longest, and two ycf1 genes of those were longer with 1130 bp and 5309 bp, respectively. On the side, two ycf1 genes of A. paradoxum were severally 521 bp and 5249 bp, resulting in the shortest cp size. This phenomenon was coincident with the studies of comparative chloroplast genomics of the genus Taxodium 22 , and the changes in length of the two genes brought out the reduction and expansion of two IR regions, which directly impacted the size of the chloroplast genome.
Further characteristics of Allium sequence divergence and diversity were analyzed through multiple ways in this study, and all results referred to the single copy regions were more variable than two IR regions. Interestingly, ycf1-rrn4.5 and ndhF-ccsA in the SSC region were the most clustered diversity sites in the whole chloroplast genome of genus Allium. These two sites contained 4 protein coding-genes (ycf1, ndhF, rpl32, and ccsA) and 5 non-coding genes (tRNA-Leu, tRNA-Asn, tRNA-Arg, rrn5 and rrn4.5). Researcher had reported that rpoC2 can be taken as a gene of choice in Allium phylogeo-graphical studies, but the results were pointed out by the phylogenetic tree of 17 members of genus Allium. Hence, digging the best candidate gene for genus Allium evolution analysis was extremely urgent. The phylogenetic tree of 55 Allium species was constructed from chloroplast genome sequences, and further evolution analyses of four variable protein-coding genes in all species were performed. Results suggested that the evolutionary relationship of ycf1 was consistent in the tree of the wholelength cp genomes. Hence, ycf1 was the best choice for assessing classification of Allium species.

Conclusions
In this study, we successfully provided a complete chloroplast genome of A. mongolicum with 130 genes. Compared to genome structure with other members from Allium genus, the size, structure, gene contents of A. mongolicum chloroplast genome was conserved. Fifty-five Allium species in total were used to comparative analysis, and the phylogenetic analysis using full-length genome sequence were in accordance with the results using highly diverse ycf1 gene. Hence, ycf1 gene can be employed to evaluate phylogenetic relationships in Allium. The molecular data in this study provide a valuable resource for the study of evolution in Allium genus.

Materials and methods
Sampling, DNA extraction. Allium    Genome comparison. Fifty-four reported cp genomes of genus Allium were downloaded from National Center for Biotechnology Information (NCBI, https:// www. ncbi. nlm. nih. gov), and the detailed information was listed in the Table 2. The divergence of 9 representative genomes was operated by mVISTA (https:// genome. lbl. gov/ vista/ index. shtml) in Shuffle-LAGAN mode 25,26 . MAFFT program was used to align all Allium species cp genomes 27 . Pi of all cp genomes was counted using Launch DnaSP6 28 , and results were presented through a sliding window analysis with a window length of 600 bp and step size of 200 bp. Boundaries of IR regions, contraction and expansion of cp genomes were visualized by Program IRscope 29 .
Codon usage analysis. We selected Program codonW 1.4.4 to obtain the values of RSCU for evaluating codon preference 30 .

Phylogenetic analysis.
A phylogenetic tree of 55 genus Allium members was constructed using MEGA-X through Maximum likelihood (ML) method 31 . Three related species were adopted as the outgroup containing Lilium brownie (MK493294), Ophiopogon bodinieri (NC_051508) and Polygonatum kingianum (MW373520). Phylogenetic trees of four targets genes (ndhF, rpl32, ccsA and ycf1) were constructed by the same way.

Data availability
The datasets generated during the current study are available in the NCBI repository (Genbank accession number: OM630416).